Author: Yiming Yang, Rimte Rocher
Date: 2022-03-07
Notebook Source: regress_out.ipynb
This tutorial runs analysis on a 10x Visium mouse brain section dataset with Pegasus.
import numpy as np
import pandas as pd
import pegasusio as io
import pegasus as pg
You can download the data at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/mouse_brain_10x.tar.gz.
After downloading, unzip the tar ball file and load the data into memory:
data = io.read_input('mouse_brain_10x', file_type='visium')
2022-03-07 14:30:29,082 - pegasusio.readwrite - INFO - visium file 'mouse_brain_10x' is loaded. 2022-03-07 14:30:29,084 - pegasusio.readwrite - INFO - Function 'read_input' finished in 1.37s.
data
MultimodalData object with 1 UnimodalData: 'mm10-visium'
It currently binds to SpatialData object mm10-visium
SpatialData object with n_obs x n_vars = 2702 x 32285
Genome: mm10; Modality: visium
It contains 1 matrix: 'X'
It currently binds to matrix 'X' as X
obs: 'in_tissue', 'array_row', 'array_col'
var: 'featureid'
obsm: 'X_spatial'
uns: 'genome', 'modality'
img: 'sample_id', 'image_id', 'data', 'scale_factor', 'spot_diameter'
pg.qc_metrics(data, mito_prefix='mt-')
Then we can view the metrics. For example, the code below shows the 2.5% and 97.5% quantiles for number of genes:
np.percentile(data.obs['n_genes'], [2.5, 97.5])
array([2917.675, 8664.475])
Based on the quantiles above, we filter the data by number of genes between 2.5% and 97.5%, and percent of mito gene expression below 20%:
pg.qc_metrics(data, min_genes = 2917, max_genes = 8664, mito_prefix='mt-', percent_mito=20)
pg.qcviolin(data, plot_type='gene', dpi=100)
pg.qcviolin(data, plot_type='count', dpi=100)
pg.qcviolin(data, plot_type='mito', dpi=100)
Now do the actual filteration below:
pg.filter_data(data)
2022-03-07 14:30:30,259 - pegasusio.qc_utils - INFO - After filtration, 2229 out of 2702 cell barcodes are kept in UnimodalData object mm10-visium. 2022-03-07 14:30:30,260 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.20s.
And identify robust genes:
pg.identify_robust_genes(data)
2022-03-07 14:30:30,555 - pegasus.tools.preprocessing - INFO - After filtration, 21680/32285 genes are kept. Among 21680 genes, 20184 genes are robust. 2022-03-07 14:30:30,555 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.29s.
pg.log_norm(data)
pg.highly_variable_features(data)
pg.pca(data)
pg.neighbors(data)
pg.leiden(data)
pg.umap(data)
2022-03-07 14:30:30,775 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.21s. 2022-03-07 14:30:30,801 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.03s. 2022-03-07 14:30:30,856 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2022-03-07 14:30:30,857 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.08s. 2022-03-07 14:30:31,535 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 0.68s. 2022-03-07 14:30:31,825 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.29s. 2022-03-07 14:30:31,912 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.09s. 2022-03-07 14:30:32,000 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 0.05s. 2022-03-07 14:30:32,220 - pegasus.tools.clustering - INFO - Leiden clustering is done. Get 11 clusters. 2022-03-07 14:30:32,225 - pegasus.tools.clustering - INFO - Function 'leiden' finished in 0.31s. 2022-03-07 14:30:32,227 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required. 2022-03-07 14:30:32,228 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s. 2022-03-07 14:30:32,232 - pegasus.tools.visualization - INFO - Using umap kNN graph because number of cells 2229 is smaller than 4096 or knn_indices is not provided. UMAP(min_dist=0.5, random_state=0, verbose=True) Mon Mar 7 14:30:32 2022 Construct fuzzy simplicial set
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Mon Mar 7 14:30:35 2022 Finding Nearest Neighbors Mon Mar 7 14:30:38 2022 Finished Nearest Neighbor Search Mon Mar 7 14:30:41 2022 Construct embedding
Epochs completed: 0%| 0/500 [00:00]
Mon Mar 7 14:30:46 2022 Finished embedding 2022-03-07 14:30:46,865 - pegasus.tools.visualization - INFO - Function 'umap' finished in 14.64s.
Run the code below to show the UMAP plot of cells colored by cluster labels generated by Leiden algorithm on their PCA embedding:
pg.scatter(data, attrs='leiden_labels')
Run the function below to perform Differential Expression analysis:
pg.de_analysis(data, cluster='leiden_labels')
2022-03-07 14:30:47,914 - pegasus.tools.diff_expr - INFO - CSR matrix is converted to CSC matrix. Time spent = 0.3630s. 2022-03-07 14:31:04,257 - pegasus.tools.diff_expr - INFO - MWU test and AUROC calculation are finished. Time spent = 16.3431s. 2022-03-07 14:31:04,329 - pegasus.tools.diff_expr - INFO - Sufficient statistics are collected. Time spent = 0.0722s. 2022-03-07 14:31:04,422 - pegasus.tools.diff_expr - INFO - Differential expression analysis is finished. 2022-03-07 14:31:04,423 - pegasus.tools.diff_expr - INFO - Function 'de_analysis' finished in 16.87s.
This is to annotate cluster-specific cell types with preset mouse brain markers:
celltype_dict = pg.infer_cell_types(data, markers = 'mouse_brain')
celltype_dict
{'1': [name: GABAergic neuron; score: 0.98; average marker percentage: 87.03%; strong support: (Gad1+,94.86%),(Gad2+,89.07%); weak support: (Slc32a1+,77.17%)],
'2': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 90.70%; strong support: (Slc17a7+,100.00%),(Neurod6+,83.39%),(Neurod2+,88.70%),
name: GABAergic neuron; score: 0.58; average marker percentage: 84.88%; strong support: (Gad1+,94.35%); weak support: (Slc32a1+,75.42%)],
'3': [name: GABAergic neuron; score: 1.00; average marker percentage: 93.95%; strong support: (Gad1+,95.16%),(Gad2+,95.97%),(Slc32a1+,90.73%),
name: Ependyma; score: 0.66; average marker percentage: 23.39%; weak support: (Ccdc153+,23.39%)],
'4': [name: Oligodendrocyte; score: 1.00; average marker percentage: 96.65%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,100.00%),(Olig1+,99.52%),(Olig2+,85.17%),(Sox10+,95.22%),
name: Microglia; score: 0.94; average marker percentage: 72.85%; strong support: (C1qb+,81.34%),(Ctss+,81.34%),(Csf1r+,77.51%); weak support: (P2ry12+,51.20%),
name: Choroid Coch; score: 0.73; average marker percentage: 18.66%; weak support: (Tgfbi+,18.66%)],
'5': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 94.69%; strong support: (Slc17a7+,100.00%),(Neurod6+,94.69%),(Neurod2+,89.37%)],
'6': [name: Oligodendrocyte; score: 0.99; average marker percentage: 86.11%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,94.61%),(Olig1+,98.53%); weak support: (Olig2+,56.86%),(Sox10+,66.67%)],
'7': [name: GABAergic neuron; score: 1.00; average marker percentage: 89.96%; strong support: (Gad1+,90.34%),(Gad2+,95.45%),(Slc32a1+,84.09%),
name: Oligodendrocyte; score: 1.00; average marker percentage: 91.38%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,96.59%),(Olig1+,99.43%),(Olig2+,69.89%),(Sox10+,82.39%),
name: Astrocyte; score: 0.50; average marker percentage: 93.47%; strong support: (Aqp4+,89.20%),(Gja1+,97.73%)],
'8': [name: Astrocyte; score: 0.68; average marker percentage: 81.60%; strong support: (Gja1+,98.05%); weak support: (Aqp4+,85.06%),(F3+,61.69%)],
'9': [name: Glutamatergic neuron; score: 0.50; average marker percentage: 91.84%; strong support: (Slc17a7+,100.00%),(Neurod2+,83.67%)],
'10': [name: Mural; score: 1.00; average marker percentage: 73.94%; strong support: (Rgs5+,80.99%),(Acta2+,66.90%),
name: Choroid Coch; score: 1.00; average marker percentage: 40.85%; strong support: (Tgfbi+,40.85%),
name: Ependyma; score: 1.00; average marker percentage: 51.41%; strong support: (Ccdc153+,51.41%),
name: Smooth muscle cell; score: 1.00; average marker percentage: 67.96%; strong support: (Vtn+,87.32%),(Colec12+,48.59%),
name: Astrocyte; score: 0.96; average marker percentage: 79.58%; strong support: (Aqp4+,90.14%),(Gja1+,95.07%); weak support: (F3+,68.31%),(Prex2+,64.79%),
name: Endothelial; score: 0.83; average marker percentage: 57.39%; strong support: (Dcn+,69.01%),(Id1+,86.62%); weak support: (Flt1+,59.86%),(Xdh+,14.08%),
name: Microglia; score: 0.73; average marker percentage: 88.03%; strong support: (C1qb+,94.37%),(Ctss+,87.32%); weak support: (Csf1r+,82.39%),
name: OPC; score: 0.50; average marker percentage: 59.15%; strong support: (Pdgfra+,59.15%),
name: Fibroblast; score: 0.50; average marker percentage: 69.01%; strong support: (Dcn+,69.01%)],
'11': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 92.05%; strong support: (Slc17a7+,100.00%),(Neurod6+,84.62%),(Neurod2+,91.54%)]}
cluster_names = pg.infer_cluster_names(celltype_dict)
pg.annotate(data, name='anno', based_on='leiden_labels', anno_dict=cluster_names)
pg.scatter(data, 'anno', legend_loc='on data')
Besides the UMAP plot, Pegasus also provides spatial function to generate spatial plots. Below is the spatial plot of cells colored by their cell types:
pg.spatial(data, 'anno')